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Abstract: It is of increasing importance to develop learning meth- 
ods for ranking. In contrast to many learning objectives, however, the 
ranking problem presents difficulties due to the fact that the space of 
permutations is not smooth. In this paper, we examine the class of 
rank-linear objective functions, which includes popular metrics such 
as precision and discounted cumulative gain. In particular, we observe 
that expectations of these gains are completely characterized by the 
marginals of the corresponding distribution over permutation matrices. 
Thus, the expectations of rank-linear objectives can always be described 
through locations in the BirkhofT polytope, i.e., doubly-stochastic ma- 
trices (DSMs). We propose a technique for learning DSM-based ranking 
functions using an iterative projection operator known as Sinkhorn 
normalization. Gradients of this operator can be computed via back- 
propagation, resulting in an algorithm we call Sinkhorn propagation, 
or SinkProp. This approach can be combined with a wide range of 
gradient-based approaches to rank learning. We demonstrate the utility 
of SinkProp on several information retrieval data sets. 

1. Introduction 

The task of ranking is straightforward to state: given a query and a set of 
documents, produce a "good" ordering over the documents based on features 
of the documents and of the query. From the point of view of supervised 
machine learning, we define the "goodness" of a ranking in terms of graded 
relevances of the the documents to the query, using an objective function 
that rewards orderings that rank the more relevant documents highly. The 
task is to take a training set of queries in which the documents are labeled 
with known relevances, and build an algorithm that is capable of producing 
orderings over queries in which the relevance labels are unknown. 

There are several aspects of this problem, however, that make it difficult 
to train a ranking algorithm and which have recently led to a flurry of 
insightful, creative approaches. The first thorny aspect of the learning-to- 
rank problem is that, unlike typical supervised classification and regression 
problems, we wish to learn a family of functions with varying domain and 
range. This difficulty arises from the property that queries may have varying 
numbers of documents; the set of input features scales with the number of 
documents and the size of the output ordering also changes. 

The second difficulty in learning to rank is that the space of permutations 
grows rapidly as a function of the number of documents. A fully Bayesian 
decision-theoretic approach to the ranking problem would construct a rich 
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joint distribution over the relevances of documents in a query and then 
select the optimal ordering, taking this uncertainty into account. While this 
is feasible for small queries, this optimization quickly becomes intractable. 
It is therefore more desirable to develop algorithms can directly produce 
permutations, rather than relevances. 

Finally, any objective defined in terms of orderings over relevance-labeled 
documents is necessarily piecewise-constant with respect to the parameters 
of the underlying function. That is, it may not be possible to compute 
the gradient of a training gain in terms of the parameters that need to be 
learned. Without such a gradient, many of the powerful machine learning 
function approximation tools, such as neural networks, become infeasible to 
train. 

In this paper, we develop an end-to-end framework for supervised gradient- 
based learning of ranking functions that overcomes each of these difficulties. 
We examine the use of doubly-stochastic matrices as differentiable relaxations 
of permutation matrices, and define popular ranking objectives in those 
terms. We show that the expected value of an important class of ranking 
gains are completely preserved under the doubly-stochastic interpretation. 
We further demonstrate that it possible to propagate gradient information 
backwards through an incomplete Sinkhorn normalization to allow learning 
of doubly-stochastic matrices. We also show how this leads to flexibility 
in the choice of pre-normalization matrices and enables our approach to 
be integrated with other powerful ranking methods, even as the number 
of per-query documents vary. We note in particular that this approach is 
well-suited to take advantage of the recent developments in the training of 
deep neural networks. 

2. Optimizing Expected Ranking Objectives via 
Doubly-Stochastic Matrices 

In a learning-to-rank problem, the training data are N sets, called queries, 
the nth of which has size J n . The items within each of these query sets 
are called documents. The features of document j in query n are denoted 

in) 

as Xj e X. In the training set, each document also has a relevance la- 
bel rj™- 1 e {0, 1, 2, ... , R}, where R indicates the maximum relevance. We 

denote the vector of relevances for query n as r^ n \ 

A ranking of the documents for a given query can be represented as a 
permutation, mapping each document j to a rank k. Each permutation 
is an element of Sj, the symmetric group of degree J. The aim then is 
to learn a family of functions that output permutations: fj : X J — > Sj, 
for Je {1,2,.. .}, where, as above, A" is a set of features associated with 
each of the J items. 

2.1. Ranking Objective Functions 

The most critical component when learning to rank is the definition of an 
objective function which identifies "good" and "bad" orderings of documents. 
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For a query of size J with known relevances, we identify three well-studied 
scoring functions over the group Sj: the order- if normalized discounted 
cumulative gain (NBCG@K) [1], the order-K precision (P@K) [2], and the 
rank biased precision (RBP) [3]. 

Writing s[k] to indicate the index of the document at rank k in permuta- 

DGG@K(s; r) 
maxDCG@K(s / ; r) 

s' 
K 

*;=i 

2 r - 1 

fl if A; = 1,2 

|log 2 (-fc) iffc>2 

The P@if objective, for binary relevances is 

1 K 

Cp®k(s; r) = — ^V s[fe] 
fe=i 

and RBP is 

J 

£rbp(s; r) = (1 - a)^2 / r s[k] a k ~ 1 , 
k=i 

where a £ [0, 1] is a "persistence" parameter. The natural training objec- 
tive, then, is to find a family of functions T = {fj : J E {1,2,.. .}} that 
maximizes the aggregate empirical gain, subject to some regularization 
penalty Q{J-): 

= argmax j-g(J-) + £ C{f Jn ({xf}^) ; {r^lA . (1) 



tion s, NDCG@K is given by 

£-ndcg@k( s ; r) = 

DCG@K(s;r) = 

S(r) = 
V(k) = 



2.2. Doubly- Stochastic Matrices as Marginals Over 
Permutations 

As discussed in Section 1, one of the difficulties with the objective in Eq. (1) 
is that it is discontinuous. That is, changes in fj only effect the training 
data via the discrete ordering, and so are piecewise-constant modulo the 
regularization Q(F). One way to address this is to replace the objectives 
of the previous section with expectations of these objectives, where the 
expectations are with respect to a distribution over rankings [4]. That is, 
given a distribution over permutations, denoted by ^>(s), and the relevances r, 
the expected gain is 

E i ,[C(r)}= ]T ^(s)C(r). (2) 
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Many ranking objectives, however, are characterized by element-wise sums 
over the associated permutation matrix S, i.e., Sj ! k = Sj, s [k]i where 6j t k is 
the Kronecker delta function. We call such objectives rank-linear, and they 
have the general form 

./ J 

£(s;r)^£% fe ^-,fc). 

3 = 1 k=l 

This has been observed previously in the literature, e.g., [5, 6]. One remark- 
able aspect of rank-linear objectives, however, is that the expectation in 
Eq. (2) is completely captured by the marginal probability that 5^ = 1: 

J J 

E^[£(r)]=^^^,fc)E^[% fe ]. (3) 

3=1 k=l 

The entry-wise marginal distributions necessary to describe this expectation 
form a doubly-stochastic matrix (DSM). A DSM is a nonnegative square 
matrix in which each column and each row sum to one. Permutation matrices 
are special cases of doubly-stochastic matrices. By Birkhoff 's Theorem, every 
doubly stochastic matrix can be expressed as the convex combination of at 
most J 2 — 2J+2 permutations (see, e.g., [7]). It is natural, then, to think 
of a DSM as a relaxation of a permutation that incorporates precisely the 
uncertainty that is appropriate for rank-linear gains. That is, the jth row of 
a DSM provides a properly-normalized marginal distribution over the rank 
of item j. As the columns must also sum to one, this set of J distributions is 
consistent in the sense that they cannot, e.g., attempt to place all documents 
in the first position. While the DSM does not indicate which permutations 
have non-zero probability, it does provide all the information that is necessary 
to determine the expected gain under rank-linear objectives. 

We leverage this interpretation of DSMs as providing a set of consis- 
tent marginal rank distributions in the expected ranking objective. Let 
II e [0, l] jxj be a doubly stochastic matrix. We interpret entry n ^ as 
the marginal probability that item j is at rank k. We use this to define a 
differcntiablc objective function 

F , r TU^ti^kGir^k) 

En [C.HDCQ®K{r)\ = — ? 1-, c 4) 

maxy Cdcg@k(s' ; r) 

that is the expectation of NDCG@if under an (unknown) distribution over 
permutations that has the marginals described by II. The expected P@K 
can be defined similarly: 



3=1 k=l 

as can the expectation of rank biased precision: 

./ J 

E n [£ R BP(r)] = (1 - a) ]T r, £ TL iik a k -\ 

3 = 1 k=l 
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Note that all three of these expectations recover their original counterparts 
when II is a permutation matrix, which could then be thought of as providing 
a set of degenerate marginals. 



2.3. Choosing a Single Permutation 

Although we have defined diffcrcntiable rank-linear objectives in terms of 
doubly-stochastic matrices, our test-time task remains the same: we must 
produce a single ordering. Under our "consistent marginals" interpretation 
of doubly-stochastic matrices, the natural objective is the one the maximizes 
the log likelihood: 



This maximization is a bipartite matching problem and can be solved 
in 0( J 3 ) time using the Hungarian algorithm [8]. For queries of more than a 
few hundred documents, this cubic time complexity becomes a bottleneck. To 
speed up selection of a permutation, we use a "short-cut" bipartite matching 
scheme that uses a quadratic algorithm to compute a global ordering and 
then uses the Hungarian algorithm on only the top P <J documents under 
the global ordering, resulting in 0(K 2 +P 3 ) running time. 

In the first pass, we compute the expected rank of each document under 
the marginal distributions implied by the doubly-stochastic matrix: 



These expected ranks can be sorted to compute an ordering s for all J 
documents. In the second phase, the Hungarian algorithm is applied to the 
top P documents and the submatrix n s -[ 1: p] il: p using the score in Eq. (5). 
This provides an improved ordering for the top P documents, while keeping 
the remainder of the permutation fixed. 

While this procedure offers no theoretical guarantees for the optimality of 
the global matching, it is well-suited to the ranking problem. First, metrics 
such as NDCG are most heavily influenced by the top-ranked documents 
in the ordering; focusing the expensive computations on this subset is a 
sensible heuristic. Second, the main way that this procedure would act 
pathologically is if the row- wise distributions were highly multimodal, i.e., 
significant mass split between very high and very low ranks. For the types of 
functions we explore in Section 4, however, this kind of behavior is unlikely. 

3. Learning to Rank with Sinkhorn Gradients 

Having relaxed our objective functions from permutations to doubly-stochastic 
matrices, we are no longer seeking to learn functions fj : X J — > Sj (which 
output permutations), but instead a family of functions g,j : X J — > Wj, 
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(a) Initial matrix (b) First iteration (c) Second iteration (d) Third iteration 



Fig 1: Hinton diagrams of three iterations of Sinkhorn normalization. The 
row and column sums are shown as squares on the right and bottom of the 
box. The matrix quickly balances: in (d) the row- and column-wise sums 
are almost indistinguishable. 



where Wj is the set of J x J doubly stochastic matrices (the Birkhoff poly- 
tope). Such functions are difficult to construct, however, as it is not straight- 
forward to parameterize doubly-stochastic matrices. This is in contrast to, 
for example, right-stochastic matrices for which we can easily construct a 
surjective function using a row-wise softmax. There is, however, an iterative 
projection procedure known as a Sinkhorn normalization [9], which takes a 
nonnegative square matrix and converts it to a doubly-stochastic matrix 
by repeatedly normalizing rows and columns. More formally, we define row 
and column normalization functions: 

T R (A) = A (A11 T ) Tc(A) = A (11 T A), 

where is the Hadamard (elementwise) division and 1 is a vector of ones. 
We can then define the iterative function 

Z\A) = { A . , [U = ° . 

\Tr(7^(^- 1 (A))) otherwise 

The function Z°° : M^ xj -> W. 7 is a Sinkhorn normalization operator and, 
when it converges, it produces a doubly-stochastic matrix. 

Sinkhorn and Knopp [10] presented necessary and sufficient conditions for 
convergence of this procedure to a unique limit. In summary, convergence is 
guaranteed for most non-negative matrices, but some non-negative matrices 
cannot be converted into doubly stochastic matrices because of their pattern 
of zeros (expressed as conditions on zero minors in the original matrix). 
0(V\ logeQ Sinkhorn steps suffice to reach e-near double stochasticity if all 
the matrix entries are in [1,V] [11]. Hence most applications of Sinkhorn 
normalization smooth the original matrix to facilitate convergence. 

In this paper, we introduce the idea of an incomplete Sinkhorn normal- 
ization, which are the functions Z l (A), for i < oo. The incomplete Sinkhorn 
normalization allows us to define the objective functions of the previous sec- 
tion in terms of square matrices whose only constraints are that the entries 
must be nonnegative. That is, if we can now produce a matrix M^ xj for 
each training query, this can be approximately normalized and the ranking 
functions can be evaluated. Most interestingly, however, it is possible to 
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compute the gradient of the training objective with regard to the initial un- 
normalized matrix. This can be done efficiently by propagating the gradient 
backward through the sequence of row and column normalizations, as in 
discriminative neural network training. We refer to this backpropgation as 
SinkProp. 

As in backpropagation, in SinkProp we proceed through layers of com- 
putation and we assume that the output of each layer provides the input 
to a scalar function for which we are computing the gradient in terms of 
the layer's inputs. This function is here denoted U{A') and we assume that 
the gradient dU(A') / dA' has already been computed. We can use this to 
compute the gradient of a row normalization via 



d 



dA hk 



U(T R {A)) = J2 



fc'=i 



ij,k' 



(j2k"=i 



=1 Aj,k" 



dU(A') 
dA 'i,k> ' 



The gradient of the column normalization is essentially identical, modulo ap- 
propriate transposition of indices. The function U(-) here corresponds to the 
composition of any number of Sinkhorn normalizations with the rank-linear 
objective. This enables us to use a small amount of code to backpropagate 
through an arbitrary-depth incomplete Sinkhorn normalization. 



4. Parameterizing the Pre-Sinkhorn Matrix 

We have so far defined a differentiable training function that allows us to 
optimize a rank-linear objective in terms of an unconstrained nonnegative 
square matrix. The final piece of our framework is to define a family of 
functions hj : X J — > J , for J e {1,2,.. .}. There are several approaches 
that could be taken to this problem. We will focus on two examples in 
which the functions hj can be computed from J evaluations of a single 
function : X ->• R D . 

4-1. Partitioned Probability Measures 

One approach to the functions hj is to construct them in terms of para- 
metric probability densities tt(u \ 9) defined on (0, 1). The parameters 6 are 
taken to be in R D for all J, so that they may be computed via 4>(x). For 
any J, we can define J equally-spaced bins in (0, 1), with edges pj\j] = j/J, 
for j e {0,1,..., J}. We then define the "output matrix" from hj to be 

rpj[k] 

A itk = 7t(m | ffj = <j){xj)) du. (6) 

Jpj[k-i\ 

In this construction, each row of the matrix arises by subdividing the 
mass of the row-specific cumulative probability distribution function whose 
parameterization is determined by 4>{xj). Intuitively, this means that for any 
given row, the more mass that appears near zero, the greater the preference 
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for that document appearing higher in the ranking. Reasonable choices 
for 7r(-) include the beta distribution, the probit, and the logit-logistic (LL): 



Both the PDF and CDF of the LL distribution are fast to compute. This 
overall construction is appealing as it is differentiable, and is a natural 
approach to regression of variable-sized square matrices. 

4-2. Smoothed Indicator Functions 

One deficiency of the approach of the previous section is that there is no 
interaction between the documents except through the Sinkhorn iterations. 
An alternative approach, inspired by SmoothRank [6], is to construct the 
pre-Sinkhorn matrix via: 



where D = l and the permutation s arises from sorting the <f)(xj). In the 
limit of d — y this recovers the permutation matrix implied by s. In the 
limit a — y oo, the matrix becomes all ones. As discussed in [6], this function 
is continuous but not differentiable. However, it is almost everywhere dif- 
ferentiable, as the discontinuities in the first derivatives only occur at ties 
between the 4>(xj). This has no practical effect for optimization purposes. 

In both of these examples, it is the function </>(•) that is of primary 
interest, and whose parameters are being optimized during training. Assum- 
ing that the document features are M-dimensional real- valued vectors, i.e., 
X = M M , then a simple linear function is a reasonable base case: <fi(x) = Wx, 
where W e R MxD . More interesting is the possibility of using deep neural 
networks and other more sophisticated function approximation tools. 

5. Empirical Evaluation on LETOR Data 

In this section we report on comparisons with other approaches to ranking, 
using seven data sets associated with LETOR 3.0 [12] benchmark. Following 
the standard LETOR procedures, we trained over five folds, each with 
distinct training, validation and test splits. We used the training objective 
of Eq. (4) , with K set to be the number of documents in the largest query. 
This achieved the best performance in our experiments, similar to the results 
reported in [6]. These experiments used the smoothed indicator function 
approach described in Section 4.2. Optimization was performed using L- 
BFGS 1 , annealing the smoothing constant a as in [6]. We initialized with the 
MLE regression weights under squared loss. Early stopping was performed 
based on the NDCG of predictions on the validation set. We regularized the 
weights using the li distance to the MLE regression weights, selecting the 

1 http: //www. cs .ubc . ca/~schmidtm/Sof tware/minFunc .html 
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regularization penalty using the validation set. To reduce training time, the 
training data were resampled into a larger number of smaller queries. Each 
initial query was turned into twenty derived queries whose documents were 
sampled with replacement from the original. The number of documents in 
each derived query was Poisson distributed, with mean determined by the 
original number of documents, up to a maximum of 200. Before performing 
the Sinkhorn normalization, a small constant (« 10 -6 ) was added to each 
entry in the matrix. Five Sinkhorn iterations were performed. When specific 
rank predictions were made at test time, the short-cut Hungarian method 
was used as described in Section 2.3, with P = 200. 

The results for the seven different data sets are shown in Figure 2. 
Several publicly available baselines 2 are shown for comparison: AdaRank 
[13], ListNet [14], SmoothRank [6] and basic regression. In each figure, the 
testing NDCG score is shown as a function of truncation level. As can be 
seen in the graphs, the Sinkhorn approach is generally competitive with the 
state-of-the-art. On the TD2003, the Sinkhorn normalization appears to 
offer a substantial advantage. 

6. Related Work 

SinkProp builds on a rapidly expanding set of approaches to rank learn- 
ing. Early learning-to-rank methods employed surrogate gain functions, 
as approximations to the target evaluation measure were necessary due 
to its aforementioned non-differentiability. More recently, methods have 
been developed to optimize expectation of the target evaluation measure, 
including SoftRank [4]. BoltzRank [15] and SmoothRank [6]. These methods 
all attempt to maximize the expected gain of the ranking under any of 
the gain functions described above. The crucial component of each is the 
estimate of the distribution over rankings: SoftRank uses a rank-binomial 
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approximation, which entails sampling and sorting ranks; BoltzRank uses 
a fixed set of sampled ranks; and SmoothRank uses a softmax on rank- 
ings based on a noisy model of scores. SinkProp can be viewed as another 
method to optimize expected ranking gain, but the effect of the scaling is 
to concentrate the mass of the distribution over ranks on a small set, which 
peaks on the single chosen rank selected by the model at test time. 

Sinkhorn scaling itself is a long-standing method with a wide variety 
of applications, including discrete constraint satisfaction problems such as 
Sudoku [16] and for updating probabilistic belief matrices [17]. It has also 
been used as a method for finding or approximating the matrix permanent, as 
the permanent of the scaled matrix is the permanent of original matrix times 
the entries of the row and column scaling vector [18]. Recently, Sinkhorn 
normalization has been employed within a regret-minimization approach for 
on-line learning of permutations [19]. 

Although this work represents the first approach to ranking that has 
directly incorporated Sinkhorn normalization into the training procedure, 
previously-developed methods have also found it useful. The SoftRank 
algorithm [4], for example, uses Sinkhorn balancing at test-time, as a post- 
processing step for the approximated ranking matrix. Unlike the approach 
proposed here, however, it does not take this step into account when op- 
timizing the objective function. SmoothRank uses half a step of Sinkhorn 
balancing, normalizing only the scores within each column of the matrix, to 
produce a distribution over items at a particular rank. 

7. Discussion and Future Work 

In this paper we have presented a new way to optimize learning algorithms 
under ranking objectives. The conceptual centerpiece of our approach is 
the observation that the expectations of certain kinds of popular ranking 
objectives — ones we have dubbed rank-linear — can be evaluated ex- 
actly even if only marginal distributions are available. This means that 
the expected value of these ranking objectives can be computed from the 
doubly-stochastic matrix that arises from their location within the Birkhoff 
polytope. To actually learn the appropriate doubly-stochastic matrices, it 
is possible to apply a well-studied iterative projection operator known as 
a Sinkhorn normalization. Remarkably, gradient-based learning can still 
be efficiently performed in the unnormalized space by backpropagating 
through the iterative procedure, which we call SinkProp. We demonstrated 
the effectiveness of this new approach by applying it to seven information 
retrieval datascts. 

There are several promising future directions for work in this area. In 
the context of ranking, the ability to backpropagate gradients of rank-linear 
objectives enables wide possibilities in retrieval of non-text documents. In 
particular, there have been significant recent advances in gradient-based 
training of neural networks for discrimination of images and speech (e.g., 
[20, 21]). The SinkProp approach could enable these networks to be trained to 
produce permutations over these more general types of objects. More broadly, 
we have shown that it is practical to backpropagate training gradients 
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through iterative projection operators. Such operators can be defined for a 
variety of structured outputs: matching problems and image correspondence 
tasks, for example. Finally, while the notion of rank-linearity is specific 
to the problem of learning permutations, it seems likely that it can be 
expanded to other types of structured-prediction tasks, leading to efficient 
computation of expected gains under DSM-like representations. 
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